1 Introduction and data used

In this study, we present the results of analysing the mis-splicing noise across different samples before and after the shRNA knockdown of a specific target gene, responsible for the production of different RNA binding proteins (RBPs).

More precisely, from the 356 RBPs studied by Van Nostrand et al. in 2020 in his publication A Large-Scale Binding and Functional Map of Human RNA Binding Proteins, 115 of them were functionally categorized under the following categories: splicing regulation, spliceosome or exon junction complex. However, only 56 of those were also profiled by ENCODE fulfilling the following requirements:

  • Need for at least 4 experiments:
    • 1 case experiment for cell line K562.
    • 1 case experiment for cell line HepG2.
    • 1 control experiment for cell line K562.
    • 1 control experiment for cell line HepG2.
  • Need for 2 different samples per cell line and cluster. That is, 2 different isogenic replicates per experiment.

In total, we required 8 samples per studied RBP, divided in 4 cases and 4 controls which, at the same time, are divided in 4 cell line HepG2 and 4 cell line K562.

Across this document, we will refer to each possible target gene as a project and each sample type (i.e. control or case) as clusters. That is, we will study 56 different projects, each one consisting of two clusters with 4 samples each.

1.1 Execution parameters

To properly execute this .Rmd, we need to define some variables:

  • Where is the data?

  • Load the metadata:

  • Output files path:

1.2 Metadata extraction

Using the ENCODE REST API, we automated the metadata extraction of the different experiments. By setting a series of filters in their search portal, we extracted a total of 170 different target genes of the shRNA knockdown with the previous requirements. As mentioned, only 56 of those were also studied by Van Nostrand et al. in his publication.

The relevant metadata of the RBPs studied in this document are the following:

The study of the metadata suggest that all ENCODE experiments were performed identically, and that the samples were extracted from the same two cell lines: K562 and HepG2. As such, all results can be compared between against each other.

2 Methods

2.1 Analysis pipeline

For each RBP, the process is split in several steps:

  1. Download and extraction of .bam files: from the ENCODE Experiment search portal, the files related to each RBP are automatically downloaded, both case and control samples.

  2. Extraction of the junctions: from the .bam files, all junctions found in samples are grouped together, named and their reads are counted. The generated file is all_reads_combined.rds

  3. Junction annotation: using the junction_annot() function from the package dasper and the reference transcriptome v105, we identified each junction and classified them as novel_donor, novel_acceptor or annotated. In this step, we also removed the junctions within the ENCODE blacklisted regions v2 and calculated the MaxEntScan. Two other filters are applied: the removal of reads smaller than 25bp and those annotated introns that are ambiguously assigned to more than one gene. The generated file is annotated_SR_details.rds

The following steps were executed on each available cluster (case or control) independently:

  1. Generation of the raw distances: by looking for overlaps between the novel junctions and the annotated junctions, we measured the distances in bp (base pairs) between the novel and the reference splice sites. We used this function to associate an annotated intron to a novel junction. This step was executed independently per sample. The generated file is cluster_distances_raw.rds

  2. Filtering the distances: combining all the previous calculated distances by sample, we removed the novel junctions associated to two or more reference introns. The generated file is cluster_distances_tidy.rds

  3. Finding the never mis-spliced junctions: using the raw distances data, we selected the reference introns that are not associated to any novel junction. Even if the association was discarded because of ambiguous junctions, we removed the reference intron from this list. The generated file is cluster_distances_tidy_all.rds

  4. Generation of the DB: with all the previous information, two main tables were created: db_introns and db_novel. Each one contains the relevant information related to reference introns and novel junctions. This includes the calculation of the percentage of protein-coding transcripts in which each annotated intron was found. The generated files are cluster_db_introns.rds and cluster_db_novel.rds.

2.2 Finding the common introns

From the generated DBs in every project, we first generated two different tables:

  • Common annotated intron table: we looped through every db_introns table and extracted only the information from the common annotated introns to all projects and clusters. To identify common annotated introns, we used their locus (i.e. seqname:start-end:strand), since it is a unique identifier. The goal was to have the same number of annotated introns per project.

  • Common novel junction table: we looped through every db_novel table and extracted only the information from the novel junctions associated to common annotated introns. Thus, we first needed to calculate the common annotated intron table. Unlike before, we can have a different number of novel junctions per project or cluster.

3 Studied metrics

3.1 MSR Studies

For each annotated intron, two Mis-Splicing Ratios (\(MSR\)) are calculated to provide a measurement of the mis-splicing frequency at the donor site (\(MSR_D\)) and acceptor site (\(MSR_A\)). To calculate these measurements, we first sum all of the novel donor/acceptor junction read counts and then divide by the sum of all annotated intron and novel junction read counts across the specific samples. It follows this formula:

\[ \begin{equation} MSR_A = \frac{\sum_{i=1}^{N}j_i}{\sum_{i=1}^{N}j_i+\sum_{i=1}^Ns_i } \end{equation} \] where \(j\) is the number of novel acceptor junction reads for a particular annotated intron, \(s\) is the number of annotated intron reads and \(N\) is the number of samples being studied. We can generate an \(MSR\) table in which each row corresponds to an annotated intron and each column to a cluster. Example of the generated \(MSR_A\) tables:

ref_junID ADAR_case ADAR_control AQR_case AQR_control BUD13_case BUD13_control
8404 0.024 0.000 0.022 0.034 0.008 0.014
12732 0.048 0.152 0.210 0.137 0.040 0.032
16759 0.000 0.004 0.003 0.000 0.000 0.032
17015 0.070 0.035 0.039 0.060 0.006 0.023
17681 0.321 0.171 0.106 0.182 0.221 0.262
19089 0.013 0.038 0.077 0.000 0.006 0.008
19097 0.000 0.000 0.000 0.000 0.000 0.000
20250 0.000 0.000 0.000 0.004 0.000 0.000
32980 0.000 0.007 0.105 0.003 0.005 0.006
32996 0.000 0.000 0.000 0.000 0.000 0.020

Once we have calculated the \(MSR_A\) and \(MSR_D\) for every annotated intron and every project, we use the paired Wilcoxon signed rank test to study if there is a significant variation in the median \(MSR\) in cases vs. controls. To be more precise:

  • Null hypothesis (H0): the observations \(MSR_{case} - MSR_{control}\) are symmetric about \(\mu\) = 0.
  • Alternative hypothesis (H1): the observations \(MSR_{case} - MSR_{control}\) are symmetric about \(\mu\) > 0 (i.e. the distribution of \(MSR_{case} - MSR_{control}\) is greater than 0).

For every project, we will have two p-values (one for each splice site) to reject or not the null hypothesis in favor of the alternative hypothesis.

3.2 Annotated intron studies

For every project, we define the MSP as the number of mis-spliced annotated introns divided by the total number of annotated introns (i.e. the percentage of annotated introns that are found to be misspliced). This metric is calculated for both the control and case samples.

\[ \begin{equation} MSP = \frac{\text{# Mis-spliced annotated introns}}{\text{# Annotated introns}} \end{equation} \]

To do so, we first count the number of annotated introns for each project (this number is the same across all projects) and the number of mis-spliced annotated introns in cases and control. We calculate the MSP for each cluster, and extract the difference as \(MSP_{case} - MSP_{control}\).

3.3 Novel junctions studies

There are two important statistics that we will study related to novel junctions: the number of unique novel junctions and the number of reads associated to novel junctions.

3.3.1 Number of unique novel junctions

For each cluster, we will measure the number of unique novel junctions associated to the common annotated introns. This measurement is calculated for the different splice sites.

We study the variation in the number of unique novel junctions for each project by dividing the number in the case samples by the number in the control sample.

\[ \begin{equation} \text{Variation in unique novel junctions [%]} = \left(\frac{\text{# Unique novel junction}_{Cerebellum}}{\text{# Unique novel junctions}_{Frontal}}-1\right)*100\% \end{equation} \]

3.3.2 Novel junctions reads

To measure the percentage of novel reads, we sum all reads of the novel junctions in a given cluster and divide it by the total number of reads in the cluster. We obtain the percentage of reads that correspond to novel junctions in the cluster.

\[ \text{Percentage of novel reads [%]}=\frac{\text{# Novel junction reads}}{\text{# Total reads}} *100\% \]

As before, we will divide the case percentage by the control percentage to study the variation between case and control.

\[ \text{Variation in percentage of novel reads [%]}=\left(\frac{\text{Percentage of novel reads}_{Cerebellum}}{\text{Percenrage of novel reads}_{Frontal}}-1\right)*100\% \]

4 Results

4.1 MSR Studies

From the results of the paired Wilcoxon ranked test, we keep only those projects in which the null hypothesis can be rejected (i.e. p-value after Bonferroni correction \(<= 0.05\)). The following tables shows the results for:

  • Donor splice site:
  • Acceptor splice site:

From the previous tables, the relevant parameter is the effect size of the Wilcoxon paired rank test. It is calculated using the Z statistic and the number of annotated introns studied, but it can be directly calculated using the function rstatix::wilcox_effsize(). Using the interpretation found here, we categorize the effect size in: small, moderate and large. We can graphically represent the effect values for all projects where the null hypothesis of the Wilcoxon signed rank test is rejected.

The X-axis represents the effect size while the Y-axis shows the results for each individual in which the median \(MSR\) difference was significant. Positive effect sizes represent that the median \(MSR\) in the case samples is higher than for the control samples. The colour is used for the splice site, being blue for the acceptor site and green for the donor site.

4.2 Percentage of misspliced annotated introns

If we study now the variation in the percentage of misspliced annotated introns, we obtain the following table:

We can also represent the results in the following graph, where the X-axis shows the difference column from the previous table. The Y-axis represents each target gene, divided by functional category.

4.3 Number of unique novel junctions

The results of the variation in the number of unique novel junctions are found in the following table:

Results are also shown in the following graph, where the X-axis shows the variation column from the previous table. The Y-axis represents each target gene, divided by cluster (cases and control) and splice site (donor and acceptor).

IMPORTANT: The novel acceptor value for target gene AQR cannot be represented in the same axis as the other target genes. The value goes to 429% increase.

4.4 Number of reads of novel junctions

The results of the variation in the percentage of novel reads are found in the following table:

Results are also shown in the following graph, where the X-axis shows the variation column from the previous table. The Y-axis represents each target gene, divided by fuctional category and splice site (donor and acceptor).